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We consider an independence feature screening technique for iden¬ 
tifying explanatory variables that locally contribute to the response 
variable in high-dimensional regression analysis. Without requiring 
a specific parametric form of the underlying data model, our ap¬ 
proach accommodates a wide spectrum of nonparametric and semi- 
parametric model families. To detect the local contributions of ex¬ 
planatory variables, our approach constructs empirical likelihood lo¬ 
cally in conjunction with marginal nonparametric regressions. Since 
our approach actually requires no estimation, it is advantageous in 
scenarios such as the single-index models where even specification 
and identification of a marginal model is an issue. By automatically 
incorporating the level of variation of the nonparametric regression 
and directly assessing the strength of data evidence supporting local 
contribution from each explanatory variable, our approach provides a 
unique perspective for solving feature screening problems. Theoreti¬ 
cal analysis shows that our approach can handle data dimensionality 
growing exponentially with the sample size. With extensive theo¬ 
retical illustrations and numerical examples, we show that the local 
independence screening approach performs promisingly. 
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1. Introduction. High-dimensional data are becoming increasingly avail¬ 
able, and they have triggered surging investigation and development of new 
theory and methods; see Hastie, Tibshirani and Friedman (2009), Fan and Lv 
(2010) and Biihlrnarm and van de Geer (2011) for overview and discussions. 
Independence feature screening is a class of rapidly developing approaches 
that is particularly useful in preliminary analysis for preprocessing data to 
reduce the scale of high-dimensional statistical problems; see, among others, 
Fan and Lv (2008) and Fan and Song (2010) for the independence screen¬ 
ing methods for linear and generalized linear models, Mai and Zou (2013) 
for variable screening in classification problems, Zhu et al. (2011), Li et al. 
(2012) and Li, Zhong and Zhu (2012) for feature screening methods using 
more general types of correlations. 

Broadly speaking, independence feature screening methods rely on rank¬ 
ing estimations measuring the marginal contributions of explanatory vari¬ 
ables. For example, Fan and Lv (2008) and Fan and Song (2010) consider 
ranking magnitudes of marginal estimators under some parametric models. 
In linear and generalized linear models, marginal estimator-based ranking 
can be viewed as equivalent to marginal correlation based ranking [Fan and 
Lv (2008), Chang, Tang and Wu (2013a)]. Various generalized versions of 
the correlation and conditional correlation are also considered as ranking 
criteria [Zhu et al. (2011), Li et al. (2012), Liu, Li and Wu (2014), Fan, 
Ma and Dai (2014)]. Recently, Fan, Feng and Song (2011) consider a rank¬ 
ing measure based on aggregating local contributions from an explanatory 
variable in a framework of nonparametric additive models using marginal 
penalized splines approach. In classification problems, Mai and Zou (2013) 
propose to use the so-called Kolmogrov filter to construct a ranking criterion 
based on aggregating sample distributional discrepancies between the two 
groups of interest at all observed values of a predictor. 

We consider in this paper an independence feature screening method for 
a general class of regression problems covering the nonparametric additive 
models, semiparametric single-index models and multiple-index models, and 
varying coefficient models as special cases. There are two building blocks for 
constructing the screening criterion in our approach. The first one is the non¬ 
parametric regression applied marginally on one explanatory variable at a 
time. For overview of nonparametric regression methods, see Fan and Gijbels 
(1996) and Hardle (1990). The second building block is empirical likelihood 
[Owen (1988)] constructed locally for the marginal nonparametric regression. 
Instead of acquiring some marginal estimators, our approach is capable of 
objectively and conveniently assessing the strength of data evidence for test¬ 
ing the local contributions of a given explanatory variable. Moreover, as has 
been noted in the literature, an independence feature screening procedure 
may miss explanatory variables that are marginally unrelated but jointly 
related to the response [Fan and Lv (2008)]. To address this issue, many 
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iterative versions of feature screening methods have been proposed. We bor¬ 
row the idea of Zhu et al. (2011) and propose an iterative version of our 
local independence feature screening procedure. 

Our study carries innovative contributions from a few aspects. First and 
foremost, the perspective of our approach is unique compared with other 
existing ones. Our approach directly targets at quantifying the strength of 
data evidence against the null hypothesis that explanatory variables are not 
locally contributing to the response variable. Hence, it actually requires no 
estimation. Moreover, as shown in our theoretical analysis, the fundamen¬ 
tal statistic in our approach is self-Studentized, automatically incorporat¬ 
ing variance of the marginal statistical approach. All existing approaches 
for nonparametric and semiparametric models require estimating marginal 
contributions and incorporate no effect from the level of variations of the 
marginal estimators. As a consequence, ranking the nonstandardized magni¬ 
tudes of the marginal estimators may not best reflect the marginal contribu¬ 
tions from predictors. Additionally, there may be difficulties when identifying 
the marginal effect becomes an issue, for example, in single- and multiple- 
index models. We show in our numerical examples that our approach out¬ 
performs others especially when the signal of the marginal contribution is 
weak, and when the variation of the response variable is more complex, all 
thanks to the unique perspective of the proposed marginal empirical likeli¬ 
hood approach. Second, existing approaches are typically investigated within 
specific families of models while our approach targets at detecting generic 
local contributions to the response variable from an explanatory variable. 
Thus, our method is suitable for capturing more general nonlinear effects in 
explanatory variables for solving a broad range of high-dimensional prob¬ 
lems. Our theoretical analysis establishes the validity for feature screening 
in a general and broad setup, allowing data dimensionality to grow expo¬ 
nentially with the sample size, and our numerical and real data examples 
demonstrate that our method performs very promisingly. 

Our investigation also contributes in solving challenging empirical like¬ 
lihood problems for high-dimensional nonparametric and semiparametric 
statistical problems. In existing literature, much effort has been devoted 
into extending the empirical likelihood of Owen (1988, 1991) for parametric 
models to nonparametric and semiparametric models; see, among others, 
Chen (1996), Chen and Qin (2000) and the review in Chen and Van Keile- 
gom (2009). For high-dimensional data, it remains open for solving empirical 
likelihood problems in nonparametric and semiparametric scenarios where 
merits such as robustness and other nonparametric features are highly de¬ 
sirable [Hjort, McKeague and Van Keilegom (2009), Chen, Peng and Qin 
(2009), Tang and Leng (2010), Leng and Tang (2012), Chang, Chen and 
Chen (2015)]. Recently, Chang, Tang and Wu (2013a) investigate marginal 
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empirical likelihood for general high-dimensional parametric models spec¬ 
ified by the estimating equations. Nevertheless, studying high-dimensional 
empirical likelihood beyond parametric models remains open because formu¬ 
lating and characterizing the empirical likelihood locally itself is known to 
be an important and difficult problem [Chen and Van Keilegonr (2009)]. Our 
study on the local feature screening procedure solves the problem of con¬ 
structing and characterizing empirical likelihood locally, which ideally fits a 
broad class of nonparametric and semiparametric models. Additionally, for 
summarizing the contribution from one specific predictor, we propose and 
justify an approach for aggregating the data evidence for local contributions, 
which in turn delivers the validity of our feature screening procedure. Re¬ 
markably, our approach can handle exponential data dimensionality even in 
the nonparametric and semiparametric settings where the convergence rate 
of nonparametric kernel regression is known to be slower [Fan and Gijbels 
(1996), Hardle (1990)]. 

The rest of the paper is organized as follows. Section 2 introduces the 
methodology of independence feature screening for nonparametric models 
and presents the corresponding theoretical properties. In Section 3, we ap¬ 
ply this unified screening approach to deal with problems in nonparametric 
additive models, single-index models and multiple-index models, and varying 
coefficient models. As a methodological extension, we outline the iterative 
version of our unified screening approach in Section 4. Our simulation studies 
in Section 5 demonstrate the effectiveness of this method. We conclude with 
a discussion in Section 6, and relegate a real data analysis and the proofs to 
the supplementary file of this paper [Chang, Tang and Wu (2016)]. 

2. Main results. 

2.1. Methods. Suppose that we have a random sample {(Xj, Y))}” =1 from 
the data model 

(2.1) Y = m(X) + e, 

where X = (X\ ,..., X p ) T and £ is the random error with E(e|X) =0. In our 
study, no specification of m(X) is required. The data dimensionality p of the 
explanatory variable vector X can grow exponentially with the sample size 
n, but the true model is very sparse in the sense that there is only a small 
fraction of the explanatory variables contributing to the response variable 
Y. Let A4* = {1 < j < p : E(Y|X) varies with the value of Xj}, and we call 
the variables indexed by A4* contributing explanatory variables. Without 
loss of generality, we assume that E(Y) = 0 implying that E{m(X)} = 0. 

Since X is high-dimensional and without any prior information on which 
of them are contributing in explaining Y, a natural idea is to investigate 
the marginal contribution from each explanatory variable in explaining Y 
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to justify whether it is relevant. For such a purpose, we consider marginal 
nonparametric regression problems: 

(2.2) min E[{V - fj(Xj)} 2 ] (j = 1,... ,p), 

Jj 


where Jz ?2 denotes the class of square integrable functions. Note that 'K(Y\Xj) 
is the minimizer of (2.2). Naturally, we use fj(x) = E(Y\Xj = x) to evaluate 
the marginal contribution of Xj locally at Xj = x. If an explanatory variable 
Xj is not contributing to Y marginally, then fj(x) = 0 for all x € X . Here, 
X is the support of Xj . This motivates us to investigate a feature screening 
procedure by assessing whether E(Y\Xj) = 0 or not for each j = 1 ,... ,p. 
However, how to develop a nice way for summarizing the impact due to Xj 
is not straightforward due to the fact that fj(x) is a function of x so that 
one needs to assess fj(x) for all values over the support of Xj. 

We consider the Nadaraya-Watson (NW) estimator for fj(x): 


(2.3) 


_ n- 1 Zti^h(X ij -x)Y i 

"- 1 ££=!**(*« -x) 


where Xh( u ) = for some kernel function /C and h is the bandwidth. 

For capturing the marginal variable effect, the choice of the NW estimator 
does not compromise the general applicability of the marginal empirical 
likelihood with other nonparametric approaches, for example, the local linear 
estimator [Fan and Gijbels (1996)], etc. Intuitively, fj(x) should be small 
for all x € X if Xj does not marginally contribute to Y. 

Empirical likelihood [Owen (1988, 2001)] is an influential nonparametric 
likelihood approach. Without requiring to assume full parametric distribu¬ 
tions, empirical likelihood shares some desirable merits of the conventional 
likelihood such as ^-distributed likelihood ratios and Bartlett correctabil- 
ity; see Chen and Van Keilegom (2009) for a review. For assessing fj{x) = 0 
at a given x without distributional assumptions, we construct the following 
empirical likelihood: 


(2.4) 


ELj(x,0) 


: Sup l n W i '■ W i ^ Wi = ^ W i^h{Xjj — x)Yi =0 


. i =1 


1=1 


2=1 


By applying the Lagrange multiplier method for solving (2.4), we obtain the 
empirical likelihood ratio: 

£j(x, 0) = —21og{ELj(x,0)} — 2nlogn 

n 

= 2]Tlog{l + A X h (X ij -x)Y i }, 

2=1 


(2.5) 
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where A is the univariate Lagrange multiplier solving 0 = Ya =i i+\ic h (Xi -x)Y • 
Since the denominator in (2.3) converges to the density of Xj evaluated at 
x, a large value of £j(x, 0) is taken as evidence against fj(x) = 0 provided 
that the density of Xj is bounded away from 0 at x. Hence, £j(x, 0) is indeed 
a statistic for testing whether or not the numerator in (2.3) has zero mean 
locally at x. If 0 is not in the convex hull of — x)li}” =1 , we define 

tj (x. 0) = oo as a strong evidence of the local contribution from Xj. 

For assessing E(Y\Xj) = 0, we propose to use 

(2.6) £j( 0) = sup £j(x, 0) 

X(zX n 

for each j = 1 where X n is a partition of the support X into sev¬ 

eral intervals. For feature screening purpose, we propose selecting the set of 
explanatory variables by 

(2.7) M ln = {l<j<p:£j(0)>ln}- 

We call our method a local independence feature screening approach by ob¬ 
serving that in (2.4) it is the local correlation evaluated at x between Xj via 
lCh(Xj — x) and Y is assessed. Such a notion is seen as extending the initial 
proposal of independence feature screening using correlations as in Fan and 
Lv (2008). As the empirical likelihood ratio £j(x, 0) is self-Studentized, the 
statistic £j( 0) is robust to the heterogeneous cases. Compared with L 2 -type 
screening statistics, the Loo-type screening statistic £j{ 0) is seen to be more 
suitable when K(Y\Xj = x) is away from zero locally at some x instead of 
globally. We will give the specification of 7 n later in Theorem 1 under which 
the proposed approach has the sure screening property, that is, capable of 
identifying the set of contributing explanatory variables. However, choosing 
7 n is generally hard in practice. Thus, following the convention of exist¬ 
ing screening methods, we suggest running procedure to preprocess data by 
selecting a prespecified number of variables. 

Practically implementing the proposed method is convenient. In practice, 
a natural choice is to evaluate the statistic (2.6) by £j( 0) = maxi<j< n £j {Xij, 0), 
where {Xjj}f =l are the n observations of the jth. explanatory variable. Since 
evaluating £j(x, 0) only involves univariate optimization when solving (2.5) 
using the Lagrange multiplier method, the screening statistics can be carried 
out easily by the existing algorithms. As for the bandwidth h in the marginal 
NW estimator (2.3), we note that conventional bandwidth selection meth¬ 
ods such as cross-validation and the reference rule [Fan and Gijbels (1996)] 
can be applied. Our theory in the next section demonstrates the validity of 
the variable screening procedure for a range of bandwidth applied in (2.3) 
including ones selected by methods like cross-validation and the reference 
rule. Our numerical examples also show that the approach implemented with 
bandwidth selected by the cross-validation performs satisfactorily. 
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2.2. Theoretical properties. Throughout this paper, we use || • ||2 and 
|| • ||oo to denote the L 2 -norm and sup-norm, respectively, and C r (l) de¬ 
notes the class of all continuous functions defined over I that are r times 
differentiable. We assume the following conditions. 

(A.l) The marginal projections {,/j}y =1 belong to C r (X). If r = 0, s 
satisfy the Lipschitz condition with order a € (0,1], that is | fj(s) — fj(t) \ < 
K\\s — t\ a for any s,t € X, where K\ is a positive constant uniformly for any 
j = 1,.. .,p. In addition, there exists a constant K 2 such that |/j r ^(x')| < K 2 
for any x € X and j = 1,... ,p. 

(A.2) The marginal density function gj of Xj satisfies 0 < K% < gj(x) < 
K 4 < 00 on X for j = 1,... ,p. In addition, we assume that each gj belongs 
to C r (X) for the r given in (A.l) and \g ( ' r \x)\ < K§ for any x € X and 
j = 1 

(A.3) There exist nonnegative constants ci > 0 and k € [0, 
such that minjg^,, ||/jHoo > cin _K , where r and a are specified in (A.l). 

(A.4) Let \\X n \\ be the largest length of the intervals in the partition X n , 
and there exists some positive constant £ such that \\X n \\ = n~^. 

(A.5) There exist positive constants Kq, K 7 and 7 such that P(|y| >u)< 
Kq exp(— Kju 1 ) for any u > 0 . 

(A. 6 ) For r specified in (A.l), if r > 1, the kernel function /C(-) is of 
order r, that is, f JC(u) du = 1, f u k K,{u) du = 0 for k = 1 ,..., r — 1 and 
f u r /C(u)du >0. If r = 0, the kernel function satisfies JC(u) > 0 for any u 
and f /C(u) du = 1 . 

Here, (A.l) is a general condition describing the continuity of each fj(x) = 
E(F"| Xj = x). If the first derivation of each fj exists, then r > 1 and a = 
1. Assumption (A. 2 ) is standard for kernel regression implying that the 
density of Xj does not vanish on its support; see, for example, Hardle (1990), 
and it implies bounded support of the explanatory variables. For ease in 
presentation, we take the same support X = [a, b] for all Xj which can be 
easily satisfied because some location-scale transformation can always be 
applied in practice if otherwise. The condition in (A.3) is for identifying 
A4*, which requires that the minimal signal strength measured by ||/j||oo 
cannot be too weak. The restriction of the minimal signal strength depends 
on the continuity of fj via r. The smoother ffs are, the weaker the condition 
on the signal strength is required, and the minimal signal strength cannot 
vanish at a rate faster than nA 1 / 2 . Assumption (A.4) regularizes the partition 
of the support X to be of size at least 0(n^). Assumption (A.5) on the tail 
distribution of the response variable is a conventional technical requirement 
for Cramer-type large deviations. For example, 7 = 2 if the response variable 
Y is a normal or sub-Gaussian distribution, and 7 = 00 if Y has a compact 
support. Assumption (A. 6 ) specifies the requirement for the kernel function 
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so that the bias due to kernel smoothing is not dominating; see Muller (1987) 
for more detail about higher order kernel functions. 

For the parameters r, a and 7 specified in above assumptions, let 

(2.8) Q\ = max{r, a}, Q2 = max{7, 2} + 2 and <5 = maxi-- 1,01. 

l7 J 

The parameter £1 characterizes the continuity of the marginal projections 
and densities. Parameters Q 2 and 5 are related to the tail probabilistic be¬ 
havior of response variable Y. Meanwhile, we assume that the bandwidth 
h used in (2.4) satisfies h x n~ w for some positive w whose specification is 
discussed later. 


Proposition 1 . Under assumptions (A.1)-(A.6), pick we 1) and 
f > k + 2w, then there exists a uniform constant C\ such that for any j € A4* 
and L — >• 00, 


*i(0)< 


^kW~ 2k ~ 2w 


2L 2 


< exp(— C\LU) 


+ exp(-Cin min{1 " 2K - u '’ (1 -' c -"' )/(1+5)} ). 


Proposition 1 gives a uniform result for all explanatory variables con¬ 
tributing in the true model. The maximum distance between the adjacent 
two points that are used to construct our procedure should be o(n~ K ~ 2w ). 
Specifically, with large probability and uniformly for all j E A4*, the diverg¬ 
ing rate of l 3 (0) is not slower than n x ~ 2K ~ 2w L~ 2 . If j ^ Ad*, that is, the 
explanatory variable Xj does not have the marginal contribution to Y (i.e., 
fj = 0), following the argument in Owen (1988) and Chang, Tang and Wu 
(2013a), it can be shown that the corresponding 0) is O p (l). Hence, 
n i/ 2 —K—w £-1 j s required to diverge as n —> 00 for sure independence screen¬ 
ing. Furthermore, we note that the requirement for the bandwidth used in 
Proposition 1 is mild, which can be naturally satisfied by the conventional 
optimal bandwidth h = 0(n -1 / 5 ) selected by the cross-validation method. 

Let L = n 1 / 2 ~ K “' u, ^ r for some t£(0,|-k-w), A more clear uniform 
result related to the probabilistic behavior of the statistics £j( 0) for j E M.* 
is described in the following corollary. 

Corollary 1 . Under assumptions (A.1)-(A.6), pick we [^-,| — k), 
t e (0, | — k — w) and £ > k + 2 w, then 

max P{^(0) < \c\Kln 2r } < exp{-Cm (1/2 - K -^- r)7 } 

+ exp(-C Y m min{1 ” 2K -"'’ (1 - K -"' )/(1+5)} ), 
where C\ is given in Proposition 1. 
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Choosing threshold level 7 n = 7 c 2 K 2 n 2r in (2.7) and noting that 

P(Ad* £ Ad 7n ) = P{There exists j € Ad* such that £j( 0) < ^cfK 2 n 2r } 

< s max P{7j(0) < ^ c\K^n 2r }, 

we establish the sure screening property of our approach in the following 
theorem based on Corollary 1. 

Theorem 1. Under assumptions (A.1)-(A.6), pickw£ [2L,| —k), = 

}jc\K 2 n 2r for some r£(0,|-fi;-m), and £ > k + 2w, then 

P(Ad* C M ln ) > 1 - sexp{-Cm( 1 / 2 - K - u, - r ^} 

- s exp(-Cin min ^ 1 - 2 ' t - u, ’( 1 -' t -^/( 1+5) >), 
where C\ is given in Proposition 1. 

Theorem 1 implies that our local independence feature screening method 
can handle nonpolynomial dimensionality: logp = o[n e ) for e = min{l — 2 k — 
w,(^ — k — w — t) 7 }. By noting that w > the highest dimensionality is 
achieved with the optimal e = min{l — 2 k — ^j-,(^ — k — ^ 7 ) 7 } when r is 
close enough to zero. It actually depends on k , q\ and 7 , that is, the signal 
strength, smoothness of fj's and the tail probabilistic behavior of Y. If Y fol¬ 
lows a normal or sub-Gaussian distribution such that 7 = 2, the correspond¬ 
ing highest dimensionality satisfies logp = o(n 1 ~ 2ft ” 2K/,ei ). Furthermore, if 
the projections fj's have derivatives of all orders such that £1 = r = 00 , then 
the highest dimensionality satisfies logp = o(n 1 ~ 2ft ). 

In what follows, we consider the size of the selected set Ad 7n under an 
ideal case that 

(2.9) max \\fj\\oo = o(n~ K ). 

j<£M* 

The key is to investigate the probabilistic behavior of PjTj(O) > | c 2 K 2 n 2r } 
for each j Ad* which is given in the next proposition. 

Proposition 2. Under assumptions (A.1)-(A.2) and (A.4)-(A.6), sup¬ 
pose ma x j( £ Mt \\fj\\oo = 0(n~ v ) for some 7 > (2g ^ 1 1)K • Pick w <E 
min{i — k , 2(7 — k )}) ; t£ (max{ I — p — — k —w) and £ > k + 2w. If 

inf Mg [a, 6 ] ^{Y 2 \Xj = u) > p for some positive p holds for any j ^ Ad*, then 
there exists a uniform positive constant C 2 such that for any j ^ Ad*, 

P{^(0) > \c\Kln 2T } < exp(-C , 2 n min{?77 ’ (1 “ ,i,)7/e2 ’ 2r ’ 7(1 -“' ) / 6} ). 
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From Proposition 2, we can find that the quantities on the right-hand side 
are decreasing as w is increasing. Thus, the optimal w = is the same as the 
one for the best dimensionality p discussed previously. Hence, the optimal 
bandwidth in our screening procedure is w : -= —, which is quite sensible 
because intuitively the smoother each fj is, the larger the bandwidth is 
allowed. The corresponding upper bound for P{T,(0) > ^c 2 K^n 2r } is given 
in the following corollary. 

Corollary 2. Under assumptions (A.1)-(A.2) and (A.4)-(A.6), sup¬ 
pose ma||/,-||oo = 0 (n~ v ) for some rj > • Pick w = r € 

(max{i - V ~ \ - (gl ^ 1)K ) and £ > (gl ^ 2)K • 7 / inf Me[a,6] E (^ 2 |^,- = 

u) > p for some positive p holds for any j ^ Ad*, then there exists a uniform 
positive constant C 3 such that for any j ^ Ad*, 

P{^(0) > \c\Kln 2T } < pexp(-C 3 n min{?77 ’ (ei - K)7 / (ei ^ ) ’ 2r ’ (ei - K)7 / (6ei)} ). 


By noting that 



we have P(|Ad 7n | > s) < E {-//(0) — |cf/\|n 2T }. Hence, from Corol¬ 

lary 2 , we obtain the following theorem for the size of Ad 7 „. 

Theorem 2. Under assumptions (A.1)-(A.2) and (A.4)-(A.6), sup¬ 
pose ma WfjWoo = OirT 71 ) for some 77 > (2g *+ 1)K , Pi c k w = 2 L, £ > 
(gl ^ 2)K and 7 n = \c\K$n 2r for some r G (max{± - V ~ 2 ^ 7 ,0}, \ - (gl + 1)K ). 
//inf ue[a>6] E(F 2 |X j = u)> p for some positive p holds for any j ^ Ad*, then 

P(|Ad 7n | >s)< pexp(-C , 3 n min{ ' ?7 ’ (f,1 - K)7/(eie2) ’ 2T ’ (ei - K)7/(6ei) }) ) 
where C 3 is given in Corollary 2. 

This theorem shows that our screening procedure well controls the set size 
of the recruited variables. With large probability, the number of the recruited 
variables is not larger than the true size s. From Theorems 1 and 2 with 
w = we have that P(Ad 7n = Ad*) -A 1 as n —» 00 provided that logp = 
0 ( n mi n{»? 7 ,(e i -«;) 7 /(eie 2 ), 2 r,(ei-re) 7 /( 6 pi),i- 2 K-K/£ii,(i/ 2 -K-K/ei-r) 7 }^ xhis selec¬ 
tion consistency property demonstrates that our approach performs very 
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well by distinguishing the true contributing variables from false ones under 
condition (2.9). Aiming to obtain the optimal diverging rate for p, we can 
select 


T = 


7/I K 

-- K - 

7 + 2 V 2 ei 


2f?i 


+ + 


■r 1 (7 — 2)/c 

if tj i>-A-+ — --— ; 

7 + 2 7 + 2 2(7 + 2)q\ 

1 7 « ( 7-2 )k 

'“7 + 2 7 + 2 2(7 + 2)q\ 


where ? can be chosen to be positive and converging to 0 as n -A 00 . Hence, 
P(M ln = -Ad*) -A 1 as n -A 00 provided that 


logp = < 


0 ( n min ( (01 -k) 7/(0102),7(1—2«-2 re/£>l)/(7+2), (01 -k)7/(60i )} ) , 

if „>l±^ + . 

7 + 2 2(7 + 2)£»i 

o(?T. min {(^ 1—K )7/0 2 ),(0i — + 7 /(20i))7} ) ^ 

• r / !+ 7 « , (7 ~ 2)k 
7 + 2 2(7 + 2)q\ 


More specifically, if the response variable Y has a compact support which 
means 7 = 00 , the above selection consistency holds if log p = o(n 1 ~ 2K_2K / ?1 ). 
Additionally, the smoothness of the projections /+s also affects the allowable 
dimensionality. When all fj G C'°°(A’) implying that £1 = r = 00 , the allow¬ 
able dimensionality turns out to be logp = o(n 1 ~ 2n ). If Y follows a normal or 
sub-Gaussian distribution that 7 = 2 and 77 = 00 which can be guaranteed by 
partial orthogonal condition [Huang, Horowitz and Wei (2010)], the selection 
consistency holds if logp = o(n mm { 1 / 2_K_K//01 ’ 1 / 3 ~ K /( 3 £ 1 )}). It is worthwhile 
to note that though we show that our approach can identify the set of con¬ 
tributing variables with probability tending to 1, practical performances can 
vary because first the results are valid asymptotically, and second choosing 
the threshold level 7 n to achieve the perfect variable selection is difficult. 


3. Applications to some special models. Our local independence feature 
screening approach does not require a specific form of the underlying model. 
Now we elaborate how the proposed approach can be applied in three fami¬ 
lies of popular nonparametric and semiparametric models: the nonparamet- 
ric additive models, the single-index models and multiple-index models, and 
varying coefficient models; and we also compare our results with existing 
ones. 


3.1. Nonparametric additive models. The nonparametric additive model 
introduced by Stone (1985) has the form Y = J2j=i s j(^j) + s, where 
si(-), ..., Sp(-) are unknown functions with zero mean and E(e|X) = 0. It 
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is a special case of model (2.1) with m(X) = Y^ P j=i s j(-^j)- For this model, 
the true model can be defined as M.* = {1 < j < p : E{s 2 (Xj)} > 0}. Recall 
that fj{x ) = E,(Y\Xj = x) following the earlier discussion in Section 2. If we 
identify the true model by 

(3.1) min ll/j lloo > cin _re 

jeM* 

for some nonnegative k, our local independence feature screening procedure 
proposed in Section 2 can be directly applied here to surely identify the 
contributing explanatory variables. 

Let us carefully compare our approach with the one in Fan, Feng and Song 
(2011) that specifically targets at the feature screening problem for non- 
parametric additive models. The screening procedure of Fan, Feng and Song 
(2011) includes four steps. First, each fj is expanded by using the B-spline 
basis functions { ^j,k}kL\ and use the truncated version f n j = J 2 k=i Pj,k^j,k 
to approximate the projection fj. Second, to estimate the coefficients , 8 j,k s 
and obtain the corresponding estimation of each projection fj. Third, to esti¬ 
mate each K{fj(Xj)} by n _1 f 2 j(Xij) where f nj is the estimation of fj 
obtained in the second step. Fourth, to screen features via the corresponding 
magnitudes of these estimates. To ensure the sure screening property, they 
assume that each fj belongs to the class of functions whose rth derivative 
satisfies the Lipschitz continuity of order 6 £ (0,1] and d = r + 6 > 0.5, where 
r is a nonnegative integer, and identify the true model by the condition 

(3.2) min E{/ 2 ( W)} > Cd n n~ 2k 

jeMt J 

for some positive constant C. Here, 0 < R < and d n is the truncation 
parameter used in approximating each fj which satisfies d n > Cn 2K, ^ 2d+l ^ 
for some positive constant C. By Theorem 1 of Fan, Feng and Song (2011), 
the sure screening property holds for their procedure if logp = o(n 1 _ 4 K d“ 3 ). 
When d n = 0(n 2K ^ 2d+1 ' > ), (3.2) implies minj^v^ E {f 2 (Xj)} > (JrX AdK 8 - 2d + l ) 
for some positive constant C. Actually, if the density of each Xj is uniformly 
bounded away from zero, these conditions are sufficient for the identification 
condition of our approach given in (3.1) to hold with k = Fan, Feng 

and Song (2011) also assume that the error e satisfies E{exp(B|e|)|X} < C 
for some positive constants B and C which implies there exist two positive 
constants b\ and 62 such that P(|e| >u)< 6 iexp(— 62 ^) for any u > 0. See 
Lemma 2.2 in Petrov (1995). On the other hand, they also assume that 
||m||oo B for some positive constant B. These two conditions together im¬ 
ply that 7 = 1 in our assumption (A.5). In this case, the sure screening prop¬ 
erty of our approach given in Theorem 1 holds if logp = o(n 1 / 2-K-K / pi ) = 
o(n 1 / 2_2dK ( 1+1 / ei )/( 2d+1 )). When d + \> (10 + 6 d — |^)k, their procedure 
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can handle faster diverging p than that of ours; otherwise, our method is 
stronger than theirs. This shows that our procedure can handle faster di¬ 
verging p when the signal strength level is weak, that is, k is large. 

In the specific case when the number of basis functions d n = 0(n 1 ^ 2d+l ' > ) 
which leads to the optimal rate for the B-spline approach [Stone (1985)], 
their approach can handle the dimensionality logp = o(n 1_4K ~ 3 ^ 2rf+1 ^). For 
such a selection of d n , the allowable dimensionality of our approach under 
which the sure screening property holds is logp = o(n 1 / 2+ ( 1+1//£ ’ 1 ^ 1// ( 4rf+2 ^ K ^). 
To make the approach of Fan, Feng and Song (2011) work with a high- 
dimensional setting for such a selection of d n , the smooth parameter d should 
be larger than 1 implying the existence of the first derivation of each fj, 
which is not required in our approach. From this point of view, our approach 
can handle the situation where each nonparametric marginal projection fj 
does not have the first derivative but being just continuous. When d > 1, 
the parameter r in (A.l) and (A.2) satisfies 1 <r < d<r + 1. If the mini¬ 
mum signal does not diminish to 0 [Lin and Zhang (2006), Huang, Horowitz 
and Wei (2010)], then k = 0. In this case, our approach and their approach 
can handle the nonpolynomial dimensionality logp = o(n 1 / 2+ ^ 1+1 / r ^^ 4a!+2 ^) 
and logp = o(n 1_3 ^ 2(i+1 )), respectively. If 2d<6 + r _1 , our approach al¬ 
lows faster diverging p than that of their approach; otherwise, their result 
is stronger than ours. This can be viewed as a price paid for our approach 
by allowing weaker requirement on the continuity of each fj and without 
requiring m(X) to be bounded. 

We note that the above diverging rates comparison is established in a case 
in favor of the approach of Fan, Feng and Song (2011) by using their identi¬ 
fication condition with the smallest d n . If d n diverges faster than n 2K /( 2rf + 1 ) ; 
the parameter k appeared in our identification condition will be smaller 
than ^d+i an d the allowable dimensionality of our approach will be im¬ 
proved. Additionally, our approach has a very good control of the size of 
the recruited variables. From Theorem 2, the set of the recruited variables is 
not larger than the true contributing covariates with large probability, which 
together with Theorem 1, imply the selection consistency of our approach 
in nonparametric additive models. 

3.2. Single-index and multiple-index models. The single-index model that 
is recognized as a particularly useful variation of the linear regression model 
has the form Y = s(/3 T X) + e, where s(-) is the conditional mean func¬ 
tion that is not explicitly specified; see Brillinger (1983) for more details. 
This kind of models is a special case of model (2.1) with m(X) = s(/3 T X). 
Since single-index model requires identifiability condition [Brillinger (1983)], 
parameters in marginal single-index models is not identifiable. Therefore, 
marginal estimator-based ranking procedure cannot be applied to handle 
the feature screening problem in the single-index model. 



14 


J. CHANG, C. Y. TANG AND Y. WU 


In this case, our local independence feature screening approach conve¬ 
niently applies, because our marginal empirical likelihood based approach 
does not require estimating parameters in the marginal models. Since our 
marginal empirical likelihood approach is capable of assessing K(Y\Xj) = 0, 
identifying the parameter in a marginal single-index model is not necessary 
for our local independence feature screening approach. Specifically, if we 
identify the true model by minll/jlloo > cin _K , then the local indepen¬ 
dence feature screening procedure and its properties discussed in Section 2 
also directly apply here. The effective application of our approach in single¬ 
index models demonstrates that the marginal empirical likelihood based 
approach is advantageous in independence feature screening. Such a merit 
is due to the new insight of the marginal empirical likelihood approach for 
screening variables by assessing the evidence against the null hypothesis that 
the explanatory variable is not contributing marginally. 

More generally, we may consider the screening for multiple-index model 
Y = s (/ djx ,. .. ,/3^-X) + e, where K is a known integer, j 3 k (k = 1 ,..., K ) 
are sparse vectors of unknown parameters, and s is an unknown function. 
The marginal condition for the jth component of X given by (2.2) still ap¬ 
plies in the multiple-index models. Nevertheless, in multiple-index models, 
identification is also an issue, which is actually more complicated and chal¬ 
lenging than that in the single-index models when considering the marginal 
contributions. Since our concern in this model can also be transformed to 
assessing E(T|Xj) = 0 or not, the local independence feature screening pro¬ 
cedure given in the last section can also be applied. 

3.3. Varying coefficient models. Varying coefficient model is useful for 
studying the variable-dependent effects in the regressions. Many methods 
have been proposed for estimation of this model. See, for example, Fan and 
Zhang (2000) for the local polynomial smoothing method, Huang, Wu and 
Zhou (2002) and Qu and Li (2006) for basis expansion and spline method. 
The varying coefficient model has the following form: 

(3.3) Y = X t (3(Z)+£, 

where X = (X\ ,...,X p ) T is a p x 1 vector of explanatory variables, Z is 
a scalar variable that takes values on a compact interval Z, and e is the 
error satisfies E(e|X,Z) = 0 almost surely. Here, (3{z) = ... ,(3 p (z)) T 

is unknown but smooth in z. One way to screen explanatory variables could 
be ignoring the impact due to Z, and applying some marginal approaches. 
However, it is not difficult to see that there are situations, for example, when 
f j3{z)dz = 0, that is, the parameter effect is zero in an average, univariate 
screening procedure ignoring Z will not be able to identify the component 
in X even if it is contributing. 
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To overcome such a difficulty, we may adjust the marginal nonparametric 
regression for varying coefficient models to incorporate the effect of Z. The 
marginal version of (3.3) is 


(3.4) Y = a j (Z) + X j b j (Z) + e j , 

where E (ij\Xj,Z) = 0 almost surely for j = 1 Here, bj(Z) can be 

interpreted as the marginal contribution of Xj in explaining Y via Z. It can 
be shown that bj(Z ) = c °^ • Thus, the marginal effect bj(z) = 0 for any 
z is equivalent to cov(Xj,Y\Z = z) = 0 for any z. In the simple case when 
E(Y\Z) = 0 or some constant free of Z, it is equivalent to cov(Xj,Y\Z) = 
E{XjY\Z ) = 0, which essentially share the same form E(y|X,-) = 0 as in the 
local independence featuring screening. In the general situation, E(Y\Z) ^ 
0 so that one needs to assess 0 = cov(Xj,Y\Z) = E[Xj{Y — E{Y\Z)}\Z] 
with the nuisance function E(Y\Z) estimated. For a kernel function /C(-), 
E(y|Z = z) can be estimated by 


(3.5) 


E(Y\Z = z) 




where /C^(u) = h 1 IC(uh x ) with bandwidth h. Let Y) = Y) — E{Y\Z 
then the marginal empirical likelihood is constructed as 


Zi), 


(3.6) EL j (z, 0) = sup 


n 

~\_w i \Wi>0,'^2w i = l,y ^WjKhjZj - z)XijYi = 0 


»=l 


i= 1 


i —1 


We then propose using 


(3.7) £j( 0) = sup £j(z, 0), 

zez n 

where £j(z, 0) = — 21og{ELj(z, 0)} — 2nlogn, and Z n is a partition of Z. 
In practice, a natural choice is to evaluate the statistic (3.7) by £j( 0) = 
maxi<j< n ^j(Zj,0) where {Zj }” =1 are the n observations of variable Z. This 
new £j{ 0 ) can be used for local independence feature screening for varying 
coefficient models analogously to that in Section 2. 

The analogous assumptions corresponding to (A.1)-(A.5) in Section 2 are 
given as follows. 

(A.I) 7 For each j = 1,... ,p, let fj(z) = cov(Xj,Y\Z = z). Assume {/, }j = j 
belong to C r (Z). If r = 0, we assume that /j-’s satisfy the Lipschitz condition 
with order a € (0,1], that is, \fj(s) — fj(t)\ < K\\s — t\ a for any s,t G Z, 
where K\ is a positive constant uniformly for any j = 1,. .. , p . In addition, 
there exists a constant K 2 such that | /-^ (z) | < K 2 for any z G Z and j = 
l,...,p. 
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(A.2)' The density function g of Z satisfies 0 < K% < g(z ) < K 4 < 00 on 
Z. In addition, we assume that g belongs to C r (Z ) for the r given in (A.l)' 
and |g( r )(z)| < K§ for any z £ Z. 

(A.3)' There exist nonnegative constants c\ > 0 and k £ [0, 

such that min^g^,, ||/j||oo > cin _K , where r and a are specified in (A.l)'. 
(A.4)' There exists some positive constant ^ such that \\Z n \\ = n x 
(A.5)' There exist positive constants Kq,Ky,'Ji and 72 such that P(|T| > 
u) < Kq exp(— K-jvZ 1 ) and P(|AC,-1 >u)< K§exp(—Kju 12 ) for any u > 0 and 
3 = h---,P- 

Assumption (A.5)' and Lemma 2 in Chang, Tang and Wu (2013b) yield 
that P(|AjT| >u)< 2 Kq exp(— K 7 U 1 ) for any u > 0 and j = 1,... ,p, where 
7 = . To investigate the theoretical properties of (3.7) with nonpara- 

metric estimation (3.5), we need the following two extra conditions: 

(A.7) E(Y\Z = z) belongs to C r (Z), where r is given in (A.l)'. If r = 0, 
we assume E(y|Z = z) satisfies the Lipschitz condition with order a where 
a is specified in (A.l)'. 

(A. 8 ) For r specified in (A.l)', if r > 1 , the kernel function /C(-) is of order 
r. If r = 0, the kernel function satisfies IC(u) > 0 for any u and f IC(u) du = 1. 

For 7 = 77 ^, we define Q 2 and 5 as (2.8). In addition, let <7 l = max{^ — 

1,0}. Meanwhile, we assume that the bandwidths h and h in constructing 
marginal empirical likelihood (3.6) and the NW estimator (3.5) for E(T|Z = 
z) satisfy h x n~ w and h x n ~^, where w and f> will be specified later. The 
property of the marginal empirical likelihood for varying coefficient models 
are given in the following theorem. 


Theorem 3. Under assumptions (A.1)'-(A.5)', (A. 6 ) and (A.7)-(A.8), 
pick w £ [2*-, | — k), 4> £ 1 — 2k), £ > k + 2 w, and = \c 2 K 2 n 2r for 

some r £ ( 0 , 7 — k — w), then there exists a uniform positive constant C 4 
such that 


P(A4* C M 7 J > 1 - sexp{-C 4 n (1/2 - K - w - T)7 } - sexp{-C 4 n^-^ 2 } 

/ ^ minll—2 k — w , } \ 

— sexp(—G471 1 ’ !+* { ) 

r {3-2(<p + K. + n-+r)}> ; , {2- ,p- 2(k. + w+ t )} , 

— sexp( — C 4 U *■ (2+2« 1 )7 2 +2 ’ 72+2 ') 

r (1-2 k-0)72 (1-k-<^)72 1 

— sexp(— C 4 n ^ 7 2+ 2 , (i+^i) 72 + lj '). 


Theorem 3 provides a general result for the sure screening property of 
the marginal empirical likelihood for varying coefficient models. The first 
two terms and the fourth term on the right-hand side of above inequality 
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are the same as those in Theorem 1. The extra terms are due to the kernel 
estimation of E(Y|Z). The analogues of Proposition 1 and Theorem 2 are 
also valid for varying coefficient models using the above marginal empirical 
likelihood. 

Fan, Ma and Dai (2014), Liu, Li and Wu (2014) and Song, Yi and Zou 
(2014) also consider the feature screening for ultra-high dimensional varying 
coefficient models. Fan, Ma and Dai (2014) and Liu, Li and Wu (2014) con¬ 
sidered the same model as (3.3) while Song, Yi and Zou (2014) allows both 
Y and X to depend on Z. Fan, Ma and Dai (2014) estimate aj (•) and bj(-) 
in (3.4) simultaneously via the B-spline basis functions expansion approach. 

Fan, Ma and Dai (2014) propose an estimator Uj for E{ co ^J(x’jz)^ } f° r 
each j = 1,... ,p measuring the marginal contribution of Xj. Then they pro¬ 
pose to select the covariates via ranking |{tj|’s. Liu, Li and Wu (2014) study 
conditional Pearson correlation as a measure of marginal contribution for 
varying coefficient models. For each j = 1,... ,p and z, they estimate condi¬ 
tional Pearson correlation p(Xj,Y\Z = z) via kernel smoothing method and 
construct an estimation p* for E{p 2 (Yj, Y\Z)}. Then they use the magni¬ 
tude of \p*\ to determine whether Xj is an important explanatory variable 
or not. The main idea of Song, Yi and Zou (2014) is similar to that of Fan, 
Ma and Dai (2014). In the sequel, we carefully compare our procedure with 
those proposed in Fan, Ma and Dai (2014) and Liu, Li and Wu (2014). 

Fan, Ma and Dai (2014) identify A4* via minjgjvi, E{cov 2 (Xj, Y\Z)} > 
Cd n rT 2K for some positive constants C and k < ■ Here, d n is the 

number of approximation terms to a j (■) and bj (■) in the estimation step. 
Details of d n can be found in Section 3.1 for discussion of additive mod¬ 
els. To guarantee the validity of the approach in Fan, Ma and Dai (2014), 
d n > Cn 2 */( 2d+1 ) is required for some positive constant C and d = r + 6 , 
where r is an integer and 0 € (0,1] are employed to describe the smoothness 
of each a 3 and bj, that is, the rth derivatives of all aj and bj are Lipschitz 
continuous of order 9. In Liu, Li and Wu (2014), the identification condition 
is given by min j^M* E{p 2 (Xj, Y\Z)} > Cn~ 2K for some positive constants C 
and R <h. Here, h<\ is a parameter employed to describe the decay rate of 
the bandwidth for the kernel smoothing step in their procedure, that is, the 
bandwidth is chosen as 0{n~ n ). Based on the moments condition and the as¬ 
sumptions inf Z £z mini<j<pvar(Xj|Z = z) > 0 and inf~ e ^ var(Y|Z = z) > 0, 
the identification condition in Liu, Li and Wu (2014) is essentially equivalent 
to minj g ^„ E{cov 2 (Yj, Y\Z)j > Cn~ 2K for some positive constant C. 

From three aspects, we compare our identification condition and theo¬ 
retical results with those of Fan, Ma and Dai (2014) and Liu, Li and Wu 
(2014). First, if the density of Z is uniformly bounded away from zero and 
infinity on its support Z, their Z/ 2 -type requirement is a sufficient condi¬ 
tion for ours proposed in assumption (A.3)L But their identification condi¬ 
tions rule out the case where cov(Yj, Y\Z = z ) only contribute largely at 
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several local small intervals on Z. Second, our method works for weaker 
signal strength than theirs. Fan, Ma and Dai (2014) can handle the case 
that [K{cov 2 (X j ,Y\Z)}} 1 / 2 >Cn~ dk ^ 2d+ ^ for some k < ■ Therefore, 

the weakest signal strength can be handled by their method cannot decay 
at a rate faster than O(n~ d ^ 8d+10 ' > ). On the other hand, the weakest signal 
strength our method can handle is at the rate of 0(n~ ei ^ 2gi+2 ^). By noting 
that r < d<r + 1, we have 2 > 8rf + 10 which implies our method can ac¬ 

commodate weaker signal strength than Fan, Ma and Dai (2014). If condition 
(C4) of Liu, Li and Wu (2014) holds, the parameter r in our notation system 
is not smaller than 2. Thus, 2^+2 — 3 which implies our method can also 
accommodate weaker signal strength than that of Liu, Li and Wu (2014). 
Third, we compare the nonpolynomial dimensionality allowed by different 
methods. As our theoretical assumptions are close to that proposed in Liu, 
Li and Wu (2014), we compare our result with theirs. When 71 = 72 = r = 2, 
which are assumed in Liu, Li and Wu (2014), from Corollary 3, our method 
can handle logp = 0 ( n mm { 4 < A- 2 M/^-K-</V 2 }) w h en T i s chosen to be close 
enough to zero and w = where the result for the method of Liu, Li 

and Wu (2014) is logp = o{n h ~ K ) for some h < ^. Notice that k < h < | in 
their setting. If we choose (j> = A, then min(40— 2k, ^ — k— ^) = | — k> H— k , 
which means our procedure can accommodate faster diverging p than theirs. 

4. Iterative feature screening. It is possible that some predictors are 
marginally unrelated but jointly related to the response as illustrated by 
Example 4.2.2 of Fan and Lv (2008). As observed in the literature, marginal 
utility-based feature screening methods may miss this type of predictors 
badly [Fan and Lv (2008), Fan, Samworth and Wu (2009)]. To overcome this 
difficulty, several versions of iterative feature screening have been proposed. 
Fan and Lv (2008) proposed to regress the response on the recruited predic¬ 
tors and use the regression residual as the new “response” to recruit further 
from the remaining predictors. While recruiting additional predictors, Fan, 
Samworth and Wu (2009) consider the joint model with both the recruited 
predictors and each additional predictor and use the conditional contribution 
of each additional predictor given those predictors that are already selected 
to recruit further from the remaining predictors. Both versions need to fit a 
joint model on the recruited predictors with or without an additional feature 
where a parametric model specification is required. However our proposed 
empirical likelihood-based screening is for a general nonparametric model 
(2.1) and, in this sense, model-free. Consequently, the aforementioned two 
versions of iterative feature screening cannot be extended to our case. Next, 
we will borrow the idea of Zhu et al. (2011) and propose an iterative version 
for our proposed empirical likelihood-based screening. 

Next, we detail our iterative screening procedure. 
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Step 1. We first apply our proposed empirical likelihood based screening 
to {(Xj, Yi), i = 1 ,..., n} and denote the selected set of predictors by A\. 

Step 2. Apply COSSO [Lin and Zhang (2006)] to data {(X ? .^, Y t ).i = 

1 ,n} and use M.\ to denote the subset of A\ that are retained by the 
COSSO. _ 

Step 3. For any j (j Adi, we regress Xj on the predictors with indices in 
based on the data {(X^^, X %3 ), i = and denote the residual 

by iij, i = 1,... ,n. For any j treat £ t j, i = 1,... ,n as the pseudo 

predictor and apply our proposed empirical likelihood-based screening to 
recruit a subset A 2 of predictors. 

Step 4. Apply COSSO to data {(X^ ^ ,>*),* = 1,... ,n} and use Xi 2 

to denote the subset of XA.\ U A 2 that are retained by the COSSO. 

Step 5. Repeat Steps 3 and 4 until Xi^ = M.k -1 or the number of predic¬ 
tors in Xik reaches some prespecihed number. 

5. Numerical examples. We next demonstrate the performance of the 
local independence feature screening methods using hve examples with com¬ 
parisons to appropriate existing alternative approaches. In what follows, we 
denote our method by EL when reporting the results. When implement¬ 
ing the local independence feature screening approach, we select the band¬ 
width h by the cross-validation method [Fan and Gijbels (1996)], and the 
Epanechnikov kernel K(u) = |(1 — u 2 )I(\u\ < 1) is applied for the marginal 
regressions. 

Example 1. This example is taken from Example 3 of Fan, Feng and 
Song (2011). Data are generated from model Y = 5gi(X\) + 3 g 2 (X 2 ) + 
453 (^ 3 ) + 654 (Xi) + < 7 £ with gi(x)=x, g 2 (x) = (2x- l) 2 , g 3 (x) = 
and g±( x ) = 0.1sin(27rx) + O.2cos(2-7rx) + 0.3sin 2 (27rx) + 0.4cos 3 (27rx) + 
0.5sin 3 (27rx). Here, predictors Xj ’s are i.i.d. random variables of Unif(0,1) 
distribution, and e ~ N(0, 1) is independent of Xj' s. We set p = 1000. Data 
sets of size n = 400 are used. We apply the proposed screening method to 
reduce the number of predictors from 1000 to 20. We consider different sig¬ 
nal noise ratios by varying a 2 at four different levels while Fan, Feng and 
Song (2011) chose a specific value of a 2 = 1.74. Table 1 reports the frequency 
of the important predictors being selected among 100 repetitions for differ¬ 
ent values of a 2 . A comparison is made with Fan, Feng and Song (2011). 
It is observed that the proposed empirical likelihood-based local indepen¬ 
dence feature screening performs similarly as the method proposed by Fan, 
Feng and Song (2011). Contributing features X 2 and X 4 are selected by 
both methods for all repetitions. Yet for feature X 2 with a slightly weaker 
contribution, our method does slightly better. 
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Table 1 

Simulation results for Example 1 


a 2 Method Xi X 2 X 3 X 4 


1 EL 

Fan, Feng and Song (2011) 
Zhu et al. (2011) 

Li, Zhong and Zhu (2012) 
Chang, Tang and Wu (2013a) 
1.74 EL 

Fan, Feng and Song (2011) 
Zhu et al. (2011) 

Li, Zhong and Zhu (2012) 
Chang, Tang and Wu (2013a) 

2 EL 

Fan, Feng and Song (2011) 
Zhu et al. (2011) 

Li, Zhong and Zhu (2012) 
Chang, Tang and Wu (2013a) 

3 EL 

Fan, Feng and Song (2011) 
Zhu et al. (2011) 

Li, Zhong and Zhu (2012) 
Chang, Tang and Wu (2013a) 


100 

97 

100 

100 

100 

90 

100 

100 

100 

1 

100 

100 

100 

27 

100 

100 

100 

1 

100 

100 

100 

96 

100 

100 

100 

88 

100 

100 

100 

3 

100 

100 

100 

24 

100 

100 

100 

1 

100 

100 

100 

95 

100 

100 

100 

87 

100 

100 

100 

2 

100 

100 

100 

24 

100 

100 

100 

1 

100 

100 

100 

93 

100 

100 

100 

85 

100 

100 

100 

1 

100 

100 

100 

20 

100 

100 

100 

1 

100 

100 


Example 2. This is a nonlinear predictor effect example with heteroge¬ 
neous conditional variance. Data are generated from model Y = —3hi{X\) + 
2.5/is (X 2 ) — ^hs(X^) + I.5/14 (-X 4 ) + ere with h\(x) = (2x — l) 2 , h 2 (x) = 


2+sin(2^) > Mz) = 2 !^) , and h 4 (x) = cos{( 2 x - 1 )tt}. 
independent and uniform over [ 0 , 1 ], e is independent of Xj 7 s and has nor¬ 
mal distribution with mean zero and its heterogeneous conditional variance 
is generated by var(e) = - T - :i . The noise level is govern by a with 

different values in the simulations. We set p = 1000 and n = 100. We ap¬ 
ply the proposed screening method to reduce the number of predictors from 
1000 to 20. For comparison purposes, we also apply the methods on data 
generated from a model with homogenous conditional variance while other 
settings are the same. The results are reported in Table 2 for the two cases. 
There are a few observations. First, those methods incorporating less lo¬ 
cal impact perform poorly in this nonlinear effect example, demonstrating 
the importance and substantial effect of the local feature of the marginal 
contributions to the response variable. Second, in this example with hetero¬ 
geneous conditional variance, our method outperforms others, and is better 
than the one of Fan, Feng and Song (2011), especially when the noise level 


cos(27nr) 


Here, Xj 7 s are 
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Table 2 

Simulation results for Example 2 with a 2 controlling the overall noise level 


<T 2 

Method 


Homogeneous 

variance 


Heterogeneous 

variance 

ATi 

x 2 

As 

x 4 

Xi 

x 2 

x 3 

x 4 

0.5 

EL 

99 

97 

97 

100 

83 

90 

81 

94 


Fan, Feng and Song (2011) 

94 

99 

90 

100 

76 

86 

78 

92 


Zhu et al. (2011) 

3 

4 

4 

4 

4 

6 

2 

2 


Li, Zhong and Zhu (2012) 

36 

59 

38 

68 

23 

43 

22 

47 


Chang, Tang and Wu (2013a) 

3 

4 

1 

2 

4 

6 

0 

0 

1.0 

EL 

94 

98 

89 

99 

66 

74 

67 

85 


Fan, Feng and Song (2011) 

88 

95 

86 

97 

53 

67 

58 

84 


Zhu et al. (2011) 

3 

4 

2 

3 

4 

4 

0 

4 


Li, Zhong and Zhu (2012) 

27 

50 

32 

60 

16 

34 

16 

37 


Chang, Tang and Wu (2013a) 

3 

5 

1 

2 

3 

5 

0 

1 

1.5 

EL 

88 

95 

87 

97 

57 

61 

48 

73 


Fan, Feng and Song (2011) 

81 

92 

81 

96 

43 

55 

41 

76 


Zhu et al. (2011) 

3 

6 

2 

2 

4 

4 

1 

2 


Li, Zhong and Zhu (2012) 

22 

44 

26 

51 

14 

24 

13 

26 


Chang, Tang and Wu (2013a) 

3 

5 

0 

2 

3 

5 

0 

1 

2.0 

EL 

84 

86 

82 

94 

46 

52 

38 

65 


Fan, Feng and Song (2011) 

77 

87 

79 

93 

34 

50 

35 

61 


Zhu et al. (2011) 

4 

5 

0 

2 

4 

5 

1 

2 


Li, Zhong and Zhu (2012) 

21 

39 

21 

47 

14 

20 

11 

20 


Chang, Tang and Wu (2013a) 

3 

5 

0 

1 

5 

4 

0 

2 


is relatively higher implying the signal is relatively weaker. This is consis¬ 
tent with our theory and the finding from Example 1, and it shows that 
our method is advantageous for detecting nonlinear effects. It also demon¬ 
strates that when the signal is weak, and when the situation is more difficult 
due to high level and more complex variations, our method delivers more 
promising results thanks to the feature of the marginal empirical likelihood 
approach. 

Example 3. Data are generated from model Y = f3\Xi + P 2 X 2 + P 3 X 3 + 
P 4 X 4 +E with independent error e ~ IV(0,1). Jointly Gaussian covariates sat¬ 
isfy E (Xj) = 0 and var (Xj) = 1 for j = 1 ,p with cov(Yj, V 4 ) = -^= for 

j 7 ^ 4 and cov (Xj ,Xji ) = \ if j and j' are distinct elements of {1,..., p} \ {4}. 
True regression coefficients are given by j3\ = P 2 = P 3 = 2, @4 = — 3\/2, and 
f3j = 0 for j > 4 such that X 4 is marginally independent of the response Y. 
Yet X 4 is the most important predictor variable in the joint model. This 
example is to illustrate that the iterative version of the proposed screening 
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Table 3 

Simulation results for Example 3 


(n,p) 

Iterative screening method 

Xi 

x 2 

x 3 

x 4 

Average for 
Xj, j> 5 

(300, 1000) 

EL 

100 

100 

100 

100 

0.0351 


Fan, Feng and Song (2011) 

100 

100 

100 

100 

0.1175 


Zhu et al. (2011) 

100 

100 

100 

100 

N/A 


Li, Zhong and Zhu (2012) 

100 

100 

100 

100 

N/A 


Chang, Tang and Wu (2013a) 

100 

100 

100 

99 

0.0281 

(400, 2000) 

EL 

100 

100 

100 

100 

0.0141 


Fan, Feng and Song (2011) 

100 

100 

100 

100 

0.0612 


Zhu et al. (2011) 

100 

100 

100 

100 

N/A 


Li, Zhong and Zhu (2012) 

100 

100 

100 

100 

N/A 


Chang, Tang and Wu (2013a) 

100 

100 

100 

100 

0.0220 


procedure works effectively. We borrow the idea of Zhu et al. (2011) to define 
the iterative version of our screening procedure as laid out in Section 4. Fan, 
Feng and Song (2011) only considered the case with (n = 400, p = 1000) while 
we consider two cases: (n = 300, p = 1000) and (n = 400, p = 2000). Simula¬ 
tion results over 100 repetitions are reported in Table 3, where we report the 
frequency of important predictors being selected and the average frequency 
of unimportant predictors being selected. It shows that the iterative screen¬ 
ing based on empirical likelihood performs similarly as the nonparametric 
screening proposed in Fan, Feng and Song (2011). In terms of average fre¬ 
quency of unimportant predictors being selected, our new method has slight 
advantage. 

Example 4. In this example, we consider a single-index type model. 
Data are generate from Y = m(X) + as, where m(X) is generated from 
exp{ — |(X^/0.8 2 + X|/0.9 2 + X|/1.0 + Xj/1.1 2 )} by appropriately scaling 
it to have zero mean and unit variance, predictors are independently gen¬ 
erated from standard normal distribution and e ~ IV(0,1) is independent of 
Xj’s. We set p = 1000 and n = 100, and vary the noise level as 0.5 and 1.0, 
respectively. We apply the proposed screening method to reduce the number 
of predictors from 1000 to 20 and compare it with the method of Fan, Feng 
and Song (2011) and additional two methods in Zhu et al. (2011) and Li, 
Zhong and Zhu (2012). In this example, the signals are strongest locally at 0 
while decay exponentially fast at other locations, and X\ is the strongest and 
X\ is the weakest in their signal strength according to the coefficients. Note 
that the iterative screening of Fan, Feng and Song (2011) is residual-based 
while that of Zhu et al. (2011) is projection-based. Thus to be fair, we only 
compare in terms of the noniterative version. The frequencies of important 
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Table 4 

Simulation results for Example f 


(J 

Method 

X x 

y 2 

X 3 

x 4 

0.5 

EL 

96 

92 

81 

63 


Fan, Feng and Song (2011) 

92 

77 

62 

32 


Zhu et al. (2011) 

1 

1 

5 

3 


Li, Zhong and Zhu (2012) 

58 

29 

26 

9 


Chang, Tang and Wu (2013a) 

0 

1 

2 

5 

1.0 

EL 

81 

69 

60 

36 


Fan, Feng and Song (2011) 

73 

62 

45 

19 


Zhu et al. (2011) 

1 

0 

7 

4 


Li, Zhong and Zhu (2012) 

37 

12 

17 

7 


Chang, Tang and Wu (2013a) 

1 

1 

2 

4 


predictors being selected over 100 repetitions are reported in Table 4 for 
different methods. We see that our method performs much better than that 
of Fan, Feng and Song (2011) thanks to the merit of our method in detect¬ 
ing local contributions. Additionally, we see that correlation based methods 
completely fail in this case while the distance correlation based method of 
Li, Zhong and Zhu (2012) can still detect signal, while our method performs 
the best. 

Example 5. We consider the varying coefficient model in this exam¬ 
ple. We generate data from model Y = X\j3\(Z) + X^fo^Z ) + X%/3(Z) + 
X 4 (3(Z) + e, where predictors are multivariate normal with E(Xj) = 0, 
var(Aj) = 1, and zero correlation, e ~ 1V(0,0.1) is independent of Xj’s, and 
Z is independently generated from the standard uniform distribution over 
[0,1]. The varying coefficients are given by f5\(z) = sin(27rz + ^), fyfe) = 
sin(27rz), /3s(z) = cos( 27T2) and ^(z) = sin(27rz + ^). We fix the dimension¬ 
ality p = 1000 and vary the sample size from 100 to 200. We try to reduce 
the dimensionality from 1000 to 20 and compare our method to Fan, Ma and 
Dai (2014), Liu, Li and Wu (2014) and Song, Yi and Zou (2014) in terms 
of the noniterative version due to the same reason as in the above example. 
Table 5 summarizes the results over 100 repetitions in terms of how often 
important predictors are selected. It shows that our methods perform com¬ 
petitively. Note that the methods of Fan, Ma and Dai (2014), Liu, Li and 
Wu (2014), Song, Yi and Zou (2014) are developed specially for the varying 
coefficient model. 

6. Discussion. We have proposed and investigated a local independent 
feature screening method using the marginal empirical likelihood in con¬ 
junction with marginal kernel smoothing methods to detect contributing 
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Table 5 

Simulation results for Example 5 


n Method X 4 X 2 X 3 X 4 


100 EL 

Song, Yi and Zou (2014) 
Liu, Li and Wu (2014) 
Fan, Ma and Dai (2014) 

Fan, Feng and Song (2011) 
Zhu et al. (2011) 

Li, Zhong and Zhu (2012) 
Chang, Tang and Wu (2013a) 

200 EL 

Song, Yi and Zou (2014) 
Liu, Li and Wu (2014) 
Fan, Ma and Dai (2014) 
Fan, Feng and Song (2011) 
Zhu et al. (2011) 

Li, Zhong and Zhu (2012) 
Chang, Tang and Wu (2013a) 


97 

93 

96 

96 

84 

85 

82 

89 

92 

98 

88 

98 

93 

95 

97 

99 

13 

8 

9 

11 

3 

6 

4 

5 

4 

6 

8 

9 

4 

2 

4 

3 

100 

100 

100 

100 

100 

100 

100 

100 

100 

100 

100 

100 

100 

100 

100 

100 

16 

13 

11 

15 

5 

9 

3 

6 

8 

15 

7 

7 

2 

7 

1 

3 


explanatory variables in a general model setting. We show that our method 
is broadly applicable in a wide class of nonparametric and semiparametric 
models for high-dimensional data analysis. Theory and numerical examples 
show that our approach works promisingly. When the minimal signal is weak 
or the collinearity level among the explanatory variables is high, indepen¬ 
dence feature screening methods will face substantial difficulty. How to solve 
the variable selection problem under such a scenario remains open, and we 
hope to work along this direction with the marginal empirical likelihood 
approach. 

Our method is based on the empirical likelihood, and thus necessarily in¬ 
herits its intensive computation. Fortunately, the marginal screening meth¬ 
ods are highly scalable by exploring the response variable’s dependence on 
each individual predictor at a time. Consequently, they are naturally suited 
for parallel computing. With parallel computing, the computational inten¬ 
siveness issue of our new method can be alleviated significantly, making it 
a practically appealing candidate method. 

Acknowledgements. We thank the Co-Editor, the Associate Editor and 
three referees for very constructive comments and suggestions which have 
improved the presentation of the paper. 
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SUPPLEMENTARY MATERIAL 

Supplement to “Local independence feature screening for nonparamet- 
ric and semiparametric models by marginal empirical likelihood” (DOI: 
10.1214/15-AOS1374SUPP; .pdf). This supplement contains a real data 
analysis and all technical proofs. 
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